This workshop is mainly adapted from the mixOmics tutorials (http://mixomics.org) and the Bioconductor vignette (https://bioconductor.org/packages/release/bioc/html/mixOmics.html).

1 Introduction

Partial least squares discriminant analysis (PLS-DA) is a statistical method that combines the techniques of partial least squares (PLS) regression and linear discriminant analysis (LDA).

PLS-DA works by first finding the linear combinations of the predictor variables (X) that are most correlated with the response variable (Y). These linear combinations are called latent variables. The latent variables are then used to train a LDA model, which can be used to classify new samples.

PLS-DA has several advantages over other classification methods. First, it is relatively robust to collinearity, which is a problem that can occur in data sets with many correlated variables. Second, PLS-DA can handle data sets with a large number of predictor variables. Third, PLS-DA can be used to visualize data, which can be helpful for understanding the relationships between the predictor variables and the response variable.

2 Load example dataset

library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')
data(breast.TCGA)

# Extract training data and name each data frame
# There are 3 dataframes (3 omics of the same samples), stored as list
# In each dataframe, the rows represent samples.
# Columns represent feature measurments (genes, miRNA, protein)
X <- list(mRNA = breast.TCGA$data.train$mrna, 
          miRNA = breast.TCGA$data.train$mirna, 
          protein = breast.TCGA$data.train$protein)

Y <- breast.TCGA$data.train$subtype
table(Y)
## Y
## Basal  Her2  LumA 
##    45    30    75

3 PLSDA

3.1 plsda()

We can use the plsda() function to find combinations of features that discriminate the groups.
The inputs are a matrix of values (sample-rows by feature-columns, X), together with a vector of the groups (Y).
We also need to choose the number of latent components (starting here with ncomp=3), and whether to add scaling and logratio (as seen in Session 1 with pca()).

plsda.mRNA <- plsda(X$mRNA, Y=Y, ncomp = 3, scale = TRUE, logratio = "CLR")

# plot the result
plotIndiv(plsda.mRNA, ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA", 
          legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)

plotIndiv() applied to plsda objects projects the samples into plsda component subspace; components 1 and 2 by default. We can look at other components, which might show further separation of groups in complex datasets.

plotIndiv(plsda.mRNA, comp = c(1,3), ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA", 
          legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)

  • Increasing the number of components improves discrimination up to a point, after which no further value is added.
  • Component 1 seems useful for separating LumA < Her2 < Basal
  • Component 2 seems useful for separating (LumA or Basal) < Her2
  • Component 3 is not really adding any further value in separating the groups.

PLSDA with a single component may be sufficient for some experiments (e.g. a binary group comparison of healthy vs disease samples), whereas experiments with many groups might be optimally separated using more components.

3.2 Loadings

Each component is a composite of all features with a weight coefficient (i.e. loadings). Loadings can be displayed using plotLoadings() and selectVar().
In PLSDA, all features are used, although the “weight” given to many of the features might be close to zero.

plotLoadings(plsda.mRNA, comp = 1, contrib = "max", method = "median")

plotLoadings(plsda.mRNA, comp = 2, contrib = "max", method = "median")

loadings.comp1 <- selectVar(plsda.mRNA, comp=1)
loadings.comp2 <- selectVar(plsda.mRNA, comp=2)
head(loadings.comp1$value)
##          value.var
## ZNF552  -0.1405038
## SLC43A3  0.1373226
## PREX1   -0.1280330
## FMNL2    0.1274558
## CCNA2    0.1256307
## LMO4     0.1222428
head(loadings.comp2$value)
##           value.var
## CDK18    -0.2059362
## TP53INP2  0.2032781
## TRIM45   -0.1901160
## STAT5A   -0.1774087
## PLEKHA4  -0.1667928
## ZNF37B   -0.1649502

4 Sparse PLSDA tuning

There is usually a small fraction of features, which contribute majorly to the usefulness of each component, while adding too many features gives diminishing returns or negative impact (due to noise). The Sparse version of PLSDA (sPLSDA) is designed to select an optimal/minimal number of important features/components via tuning procedures.

General advice for choosing parameters in the tuning procedure:

  • When the dataset has <10 samples try the “loo” (leave-one-out) method of validation
  • With >10 samples, use “Mfold” and set folds = N (where N is up max ~10, and at minimum allowing ~5 samples per group).
  • nrepeat: start with a low number for initial testing (<10) and increase it for proper tuning once all the settings are decided (>50). This can scale to become a very slow step with complicated data and many repetitions (e.g. 30 mins on a laptop). nrepeat can only be 1 for the “loo” method.
  • cpus: the number of processor cores you want to use in parallel (usually your total minus 1 is ideal).

The next two parameters determine how error/success the rate is defined, which will depend on the complexity of the data:

  • measure (value of error rate to optimise down): “BER” is balanced error rate, which is a particularly good choice for experiments with unequal experiment groups. Alternatives are “overall” or “AUC” (area under curve).
  • dist: options are “max.dist”, “centroids.dist”, or “mahalanobis.dist”, in increasing order of complexity. Usually the simpler distance options are preferable.
test.keepX <- 2^(1:7)
test.keepX
## [1]   2   4   8  16  32  64 128
set.seed(123)

tune.splsda.mRNA <- tune.splsda(X$mRNA, Y = Y, ncomp = 3, validation = "Mfold", 
                                folds = 5, test.keepX = test.keepX,
                                dist = "centroids.dist",
                                scale = TRUE, logratio = "CLR", 
                                nrepeat = 30, cpus = 3, progressBar = FALSE, 
                                measure = "BER")
  
plot(tune.splsda.mRNA)

Inspect the performance (BER), which you want to minimise down, while increasing the number of components, and increasing the number of features used.

A few observations:

  • In this example, component 1 alone (blue) seems optimised at 64 genes.
  • Adding a component: 1+2 (orange) seems optimised at 12 genes.
  • Adding a component: 1+2+3 (grey) does not clearly add any further benefit.
  • So an ideal model might use two components, with 64 and 12 genes, respectively.
  • We could also do a more detailed ‘scan’ of possible values (test.keepX).
  • Considering also that fewer genes might be desirable for a minimal gene signature we could also choose a lower number of genes with only a slight loss in performance (e.g. 32+12 genes).
test.keepX <- round(2^seq(3,6.5,0.25))
test.keepX

set.seed(123)

tune.splsda.mRNA <- tune.splsda(X$mRNA, Y = Y, ncomp = 3, validation = "Mfold", 
                                folds = 5, test.keepX = test.keepX,
                                dist = "centroids.dist",
                                scale = TRUE, logratio = "CLR", 
                                nrepeat = 60, cpus = 3, progressBar = FALSE, 
                                measure = "BER")
saveRDS(tune.splsda.mRNA, file = "tune.splsda.mRNA.rds")
tune.splsda.mRNA <- readRDS("tune.splsda.mRNA.rds")

plot(tune.splsda.mRNA)

# inspect the suggested optimal values of ncomp and keepX
tune.splsda.mRNA$choice.ncomp$ncomp # suggests 2 components
tune.splsda.mRNA$choice.keepX # comp1: keep 54 features, comp2: keep 23 features

5 Tuned sPLSDA

Based on the last tuning procedure, let’s make a ‘tuned’ sPLSDA model using two components (with 54 and 23 genes).

tuned.splsda.mRNA <- splsda(X$mRNA, Y = Y, ncomp = 2, keepX = c(54, 23))
  
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 54 + 23 genes", 
          ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6, 
          legend.title = "Cancer\nSubtype", legend = TRUE)

# we can re-plot with a background-shading of the group boundary calls
background = background.predict(tuned.splsda.mRNA, comp.predicted=2, dist = "centroids.dist")
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 54 + 23 genes", 
          ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6, 
          legend.title = "Cancer\nSubtype", legend = TRUE, 
          background = background)

The separation between groups looks as good or better than the initial PLSDA model, and requires fewer features.

6 Feature Loadings

We can inspect the top-ranked chosen genes and their loadings.

plotLoadings(tuned.splsda.mRNA, comp = 1, contrib = "max", method = "median", size.name = 0.3)

plotLoadings(tuned.splsda.mRNA, comp = 2, contrib = "max", method = "median")

loadings.comp1 <- selectVar(tuned.splsda.mRNA, comp=1)
loadings.comp2 <- selectVar(tuned.splsda.mRNA, comp=2)
head(loadings.comp1$value)
##         value.var
## ZNF552 -0.3509967
## KDM4B  -0.3197369
## PREX1  -0.2712825
## LRIG1  -0.2691817
## CCNA2   0.2566730
## TTC39A -0.2460363
head(loadings.comp2$value)
##           value.var
## TP53INP2  0.5245244
## CDK18    -0.5079990
## NDRG2    -0.3046127
## TRIM45   -0.2836764
## STAT5A   -0.2636752
## PLEKHA4  -0.2417573
# save a list of selected imporant features to a csv file
write.csv(rbind(data.frame(loadings.comp1), data.frame(loadings.comp2)),
          file = "splsda.mRNA.selectedFeatures.csv", row.names = FALSE)

7 Feature Stability

Stability of the selected features can be reviewed with the perf() function, which repeats the selection process and estimates the mean squared error of prediction (MSEP), R2, and Q2 metrics.

The features with high stability/frequency are interpreted to be likely more integral to the model than the other features.

8 Heatmap

cim() is a plotting function in mixOmics, which can generate a basic heat map of expression (or other values, depending on context). cim is an abbreviation of Clustered Image Map.

Note that saving the output (large image) as a file is recommended, rather than viewing in RStudio directly. Alternatively, you can open a new window for graphing with x11().

cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA")

We can try making the heatmap more informative by:

  • adding a sidebar colour-coding the samples by experimental group
  • plotting the genes from components 1 and 2 separately
  • customising the legend
  • adjusting the size of labels/margins so that they fit (or not trying to display things that won’t fit)
legend=list(legend = levels(Y), # set of classes
            col = unique(color.mixo(Y)), # set of colours
            title = "Cancer Subtype", # legend title
            cex = 0.7) # legend size

cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp1", comp = 1, 
    row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE, 
    margins = c(5,2), title = "mRNA sPLSDA comp1", ylab = "Samples")
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp2", comp = 2,
    row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE,
    margins = c(5,2), title = "mRNA sPLSDA comp2", ylab = "Samples")

9 Predictions

sPLS-DA models can be useful to find a good combination of distinctive features that are associated with the experimental groupings. These models can also be used to make predictions, classifying a new sample.

# Prepare test set data
data.test.mRNA <- breast.TCGA$data.test$mrna
Y.test <- breast.TCGA$data.test$subtype
table(Y.test)
## Y.test
## Basal  Her2  LumA 
##    21    14    35
predict.mRNA <- predict(tuned.splsda.mRNA, newdata = data.test.mRNA, 
                               dist = "centroids.dist")

predict.mRNA # List the different outputs
## 
## Call:
##  predict.mixo_spls(object = tuned.splsda.mRNA, newdata = data.test.mRNA, dist = "centroids.dist") 
## 
##  Main numerical outputs: 
##  -------------------- 
##  Prediction values of the test samples for each component: see object$predict 
##  variates of the test samples: see object$variates 
##  Classification of the test samples for each distance and for each component: see object$class
  • The output of predict() includes prediction scores for each sample x group.
  • We can check the accuracy of the classifier on a confusion matrix.
head(predict.mRNA$predict)
## , , dim1
## 
##          Basal      Her2         LumA
## A54N 0.8241670 0.2646864 -0.088853486
## A2NL 0.9853394 0.2845764 -0.269915847
## A6VY 0.7385149 0.2541163  0.007368833
## A3XT 0.8312773 0.2655639 -0.096841200
## A1F0 0.6333344 0.2411362  0.125529447
## A26I 0.7314737 0.2532473  0.015278979
## 
## , , dim2
## 
##          Basal          Her2        LumA
## A54N 1.0494133 -0.1409544132  0.09154107
## A2NL 1.1428287  0.0009576082 -0.14378632
## A6VY 0.7884695  0.1641541848  0.04737632
## A3XT 1.0221914 -0.0782488158  0.05605746
## A1F0 0.8016601 -0.0619977372  0.26033763
## A26I 0.6551407  0.3907136338 -0.04585434
confusion.mat.mRNA <- get.confusion_matrix(truth = Y.test, 
                predicted = predict.mRNA$MajorityVote$centroids.dist[,2])
confusion.mat.mRNA
##       predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal                 20                 1                 0
## Her2                   0                14                 0
## LumA                   0                 1                34
get.BER(confusion.mat.mRNA)
## [1] 0.02539683
  • The Balanced Error Rate (BER) suggests the accuracy is good (>97%)

10 Practice

  • Perform supervised clustering (sPLSDA) based on the miRNA data layer, to optimally separate the three subtypes.
  • How about if we want to separate the Basal subtype from the other two subtypes, in a binary classification?
  • (optional) There is miRNA data for the independent test samples (breast.TCGA$data.test$mirna), how accurately can your model classify them?

11 Discussion and Recap

  • Unsupervised vs Supervised analysis
  • Feature Selection, Stability
  • Predictions
  • Strengths and weaknesses of PLSDA
    • identify features that separate classes
    • identify noise that separates classes
  • Recommendations: cross-validation, test independent data, interpret with caution!